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Abstract 

Social media is a rich source of rumours 
and corresponding community reactions. 
Rumours reflect different characteristics, 
some shared and some individual. We for¬ 
mulate the problem of classifying tweet 
level judgements of rumours as a super¬ 
vised learning task. Both supervised and 
unsupervised domain adaptation are con¬ 
sidered, in which tweets from a rumour are 
classified on the basis of other annotated 
rumours. We demonstrate how multi-task 
learning helps achieve good results on ru¬ 
mours from the 2011 England riots. 

1 Introduction 


text 

position 

Birmingham Children’s hospital has 
been attacked. F***ing morons. 

#UKRiots 

support 

Girlfriend has just called her ward 
in Birmingham Children’s Hospital & 
there’s no sign of any trouble #Birm- 
inghamriots 

deny 

Birmingham children’s hospital 
guarded by police? Really? Who 
would target a childrens hospital 
#disgusting #Birminghamriots 

question 


Table 1: Tweets on a rumour about hospital being 
attacked during 2011 England Riots. 


There is an increasing need to interpret and act 
upon rumours spreading quickly through social 
media, especially in circumstances where their ve¬ 
racity is hard to establish. Eor instance, during 
an earthquake in Chile rumours spread through 
Twitter that a volcano had become active and 
that there was a tsunami warning in Valparaiso 
dMendoza et al., 2010| ). Other examples, from 
the riots in England in 2011, were that rioters 
were going to attack Birmingham’s children hos¬ 
pital and that animals had escaped from the zoo 
( [Procter et al, 2013] ). 

Social scientists ( [Procter et al., 2013] ) analysed 
manually a sample of tweets expressing different 
judgements towards rumours and categorised them 
manually in supporting, denying or questioning. 
The goal here is to carry out tweet-level judge¬ 
ment classification automatically, in order to assist 
in (near) real-time rumour monitoring by journal¬ 
ists and authorities ( [Procter et al., 201^ . In ad¬ 
dition, information about tweet-level judgements 
has been used as a first step for early rumour de¬ 
tection by ( [Zhao et al., 2015[ ). 

The focus here is on tweet-level judgement clas¬ 
sification on unseen rumours, based on a training 


set of other already annotated rumours. Previous 
work on this problem either considered unrealis¬ 
tic settings ignoring temporal ordering and rumour 
identities ( Qazvinian et al., 2011[ | or proposed reg¬ 
ular expressions as a solution ( [Zhao et al., 2015[ l. 
We expect posts expressing similar opinions to ex¬ 
hibit many similar characteristics across different 
rumours. Based on the assumption of a common 
underlying linguistic signal, we build a transfer 
learning system that labels newly emerging ru¬ 
mours for which we have little or no annotated 
data. Results demonstrate that Gaussian Process- 
based multi task learning allows for significantly 
improved performance. 

The novel contributions of this paper are: 
1. Eormulating the problem of classifying judge¬ 
ments of rumours in both supervised and unsuper¬ 
vised domain adaptation settings. 2. Showing how 
a multi-task learning approach outperforms single¬ 
task methods. 


2 Related work 

In the context of rumour spread in social me¬ 
dia, researchers have studied differences in infer- 
























mation flows between content of varying credi¬ 
bility. For instance, Procter et al. (201 3] l grouped 
source tweets and re-tweets into information flows 
dLotan et al., 201 1| ), then ranked these by flow 
size, as a proxy of significance. Information flows 
were then categorised manually. Along similar 
vein, Mendoza et al. (20T0l ) found that users deal 
with true and false rumours differently: the former 
are affirmed more than 90% of the time, whereas 
the latter are challenged (questioned or denied) 
50% of the time. Frigged et al. (2014] | analyzed 
a set rumours from the Snopes.com website that 
have been matched to Facebook public conver¬ 
sations. They concluded that false rumours are 
more likely to receive a comment with link to 
Snopes.com website. However, none of the above 
attempted to automatically classify rumours. 


With respect to automatic methods for detecting 
misinformation and disinformation in social me¬ 
dia, [Ratkiewicz et al. (201 1[ ) detect political abuse 
(a kind of disinformation) spread through Twit¬ 
ter. The task is defined in purely information 
diffusion settings and is not necessarily related 
with the truthfulness of the piece of information. 
ICastillo et al. (20 13] ) proposed methods for identi¬ 
fying newsworthy information cascades on Twitter 
and then classifying these cascades as credible and 
not credible. The main difference from our task is 
that credibility classification is carried out over the 
entire information cascade, classified objects are 
not necessarily rumours and no explicit judgement 
classification was performed in their approach. 


Early rumour identification is the focus of 
|Zhao et al. (2015| ), where regular expressions are 
used for finding questioning and denying tweets 
as a key pre-requisite step for rumour detection. 
Unfortunately, when we applied these regular ex¬ 
pressions on our dataset, they yielded only 16% 
recall for questioning and 14% recall for denying 
tweets. Consequently, this motivated us to seek a 
better approach to tweet-level classification. 


The work most relevant to ours is due 
to lQazvinian et al. (201 1[ ). Their method first car¬ 
ries out rumour retrieval, whereby tweets are clas¬ 
sified into rumour related and non-rumour re¬ 
lated. Next, rumour-related tweets are classified 
into supporting and not-supporting. The classi¬ 
fier is trained by ignoring rumour identities, i.e., 
pooling together tweets from all rumours, and ig¬ 
noring the temporal dependencies between tweets. 
In contrast, we formulate the rumour classifica- 


Rumour 

Supporting 

Denying 

Questioning 

army bank 

62 

42 

73 

hospital 

796 

487 

132 

London Eye 

177 

295 

160 

McDonald’s 

177 

0 

13 

Miss Selfridge’s 

3150 

0 

7 

police beat girl 

783 

4 

95 

ZOO 

616 

129 

99 


Table 2: Counts of tweets with supporting, deny¬ 
ing or questioning labels in each rumour collec¬ 
tion. 


tion problem as transfer learning, where unseen 
rumours (or rumours with few initial tweets ob¬ 
served) are classified using already known ru¬ 
mours - a much harder and more practical setting. 
Moreover, unlike IQazvinian et al. (201 1[ ), we con¬ 
sider the multi-class classification problem and do 
not collaps questioning and denying tweets into a 
single class, since they differ significantly. 


3 Data 


We evaluate our work on several rumours circu¬ 
lating on Twitter during the England riots in 2011 
(see Table |2ll. The dataset was analysed and an¬ 
notated manually as supporting, questioning, or 
denying a rumour, by a team of social scientists 
studying the role of social media during the riots 
dProcter et al., 2013] ). The original dataset also in¬ 
cluded commenting tweets, but these have been 
removed from our experiments due to their small 
number (they constituted only 5% of the corpus). 

As can be seen from the dataset overview 
in Table |2j different rumours exhibit varying 
proportions of supporting, denying and ques¬ 
tioning tweets, which was also observed in 


other studies of rumours (Mendoza et al., 2010 


IQazvinian et al., 201 1| ). These variations in major¬ 
ity classes across rumours underscores the mod¬ 
eling challenge in tweet-level classification of ru¬ 
mour attitudes. 


With respect to veracity, one rumour has been 
confirmed as true (Miss Selfridge’s being on fire), 
one is unsubstantiated (police beat girl), and the 
remaining five are known to be false. Note, how¬ 
ever, that the focus here is not on classifying truth¬ 
fulness, but instead on identifying the attitude ex¬ 
pressed in each tweet towards the rumour. 



































4 Problem formulation 

Let i? be a set of rumours, each of which consists 
of tweets discussing it, 'ir&R Tr = {t^, ■ ■ ■ 

T = is the complete set of tweets from all 

rumours. Each tweet is classified as supporting, 
denying or questioning with respect to its rumour: 
y{t) G {0,1,2}, where 0 denotes supporting, 1 
means denying and 2 denotes questioning. 

First, we consider the Leave One Out (LOO) 
setting, which means that for each rumour r G R, 
we construct the test set equal to Tr and the train¬ 
ing set equal to T \Tr. Therefore this is a very 
challenging and realistic scenario, where the test 
set contains an entirely unseen rumour, from those 
in the training set. 

The second setting is Leave Part Out (LPO). 
In this formulation, a very small number of ini¬ 
tial tweets from the target rumour is added to the 
training set {t\, - ■ ■ , This scenario becomes 
applicable typically soon after a rumour breaks 
out and journalists have started monitoring and 
analysing the related tweet stream. The experi¬ 
ments section investigates how the number of ini¬ 
tial training tweets influences classification perfor¬ 
mance on a fixed test set, namely: , • • • , }, 

l>k. 

The tweet-level classification problem here as¬ 
sumes that tweets from the training set are al¬ 
ready labelled with the rumour discussed and the 
attitude expressed towards that. This information 
can be acquired either via manual annotation as 
part of expert analysis, as is the case with our 
dataset, or automatically, e.g. using pattern-based 
rumour detection dZhao et al., 20151 ). Afterwards, 
our method can be used to classify the attitudes ex¬ 
pressed in each new tweet from outside the train¬ 
ing set. 

5 Gaussian Processes for Classification 


The central concept of Gaussian Process Classi¬ 


fication (GPC; ( [Rasmussen and Williams, 2005 | l) 
is a latent function / over inputs 
x: /(x) ^P(m(x), A:(x, x')), where m is 
the mean function, assumed to be 0 and k is the 
kernel function, specifying the degree to which 
the outputs covary as a function of the inputs. We 
use a linear kernel, /i:(x, x') = cr^x^x'. The latent 
function is then mapped by the probit function 
<!>(/) into the range [0,1], such that the resulting 
value can be interpreted as p{y = l|x). 

The GPC posterior is calculated as 

p(y|f)p(f) 


P(/I^,y,x*) = J p(/*|X,x*,f)- 


p{y\x) 


df, 


where p(y|f) = is the 

j=t 

Bernoulli likelihood of class y. After calculating 
the above posterior from the training data, this is 
used in prediction, i.e.. 


p(y* = l|X,y,x*)=y (/*)p (/*|X,y,x*) d/* . 

The above integrals are intractable and approx¬ 
imation techniques are required to solve them. 
There exist various methods to deal with calculat¬ 
ing the posterior; here we use Expectation Prop¬ 


agation (EP; ( |Minka and Lafferty, 2002[ )). In EP, 
the posterior is approximated by a fully factorised 
distribution, where each component is assumed to 
be an unnormalised Gaussian. 

In order to conduct multi-class classification, 
we perform a one-vs-all classification for each 
label and then assign the one with the high¬ 
est likelihood, amongst the three (supporting, 
denying, questioning). We choose this method 
due to interpretability of results, similar to re¬ 
cent work on occupational class classification 


( |Preotiuc-Pietro et al., 2015] ). 


Gaussian Processes are a Bayesian non-parametric 
machine learning framework that has been shown 
to work well for a range of NLP problems, 
often beating other state-of-the-art methods 


(Cohn and Specia, 2013 Lampos et al, 2014 


Beck et al., 2014t |Preotiuc-Pietro et al., 2015D . 


We use Gaussian Processes as this probabilis¬ 
tic kernelised framework avoids the need for 
expensive cross-validation for hyperparameter 
selection^ 

'There exist frequentist kernel methods, like SVMs, 
which additionally require extensive heldout parameter tun- 


Intrinsic Coregionalization Model In the LPO 

setting initial labelled tweets from the target ru¬ 
mour are observed as well. In this case, we pro¬ 
pose to weight the importance of tweets from 
the reference rumours depending on how simi¬ 
lar their characteristics are to the tweets from the 
target rumour available for training. To handle 
this with GPC, we use a multiple output model 
based on the Intrinsic Cor egionalisation Model 
(ICM; ( Alvarez et al, 2012 1). It has already been 
applied successfully to NLP regression problems 


mg. 





















dBeck et al., 2014| ) and it can also be applied to 
classification ones. ICM parametrizes the kernel 
by a matrix which represents the extent of covari¬ 
ance between pairs of tasks. The complete kernel 
takes form of 


method 

acc 

Majority 

0.68 

GPPooled Brown 

0.72 

GPPooled BOW 

0.69 


/c((x, d), (x',d')) = kdataix,X.')Bd,d' , 

where B is a square coregionalization matrix, d 
and d' denote the tasks of the two inputs and kdata 
is a kernel for comparing inputs x and x' (here, lin¬ 
ear). We parametrize the coregionalization matrix 
B = kI + vv^, where v specifies fhe correlafion 
befween fasks and fhe vecfor k confrols exfenf of 
fask independence. 

Hyperparameter selection We tune hyperpa¬ 
rameters V, K and cr^ by maximizing evidence of 
the model p{y\X), thus having no need for a vali¬ 
dation set. 

Methods We consider GPs in three different set¬ 
tings, varying in what data the model is trained on 
and what kernel it uses. The first setting (denoted 
GP) considers only target rumour data for train¬ 
ing. The second (GPPooled) additionally consid¬ 
ers tweets from reference rumours (i.e. other than 
the target rumour). The third setting is GPICM, 
where an ICM kernel is used to weight influence 
from tweets from reference rumours. 


6 Features 


We conducted a series of preprocessing steps in or¬ 
der to address data sparsity. All words were low¬ 
ercased; stop words removed; all emoticons were 
replaced with word^; and stemming was per¬ 
formed. In addition, multiple occurrences of a 
character were replaced with a double occurrence 
(Agarwal et al., 20111, to correct for misspellings 
and lengthenings, e.g., looool. All punctuation 
was also removed, except for .,! and ?, which we 
hypothesize to be important for expressing emo¬ 
tion. Lastly, usernames were removed as they tend 
to be rumour-specific, i.e., very few users com¬ 
ment on more than one rumour. 

After preprocessing the text data, we use ei¬ 
ther the resulting bag of words (BOW) feature 
representation or replace all words with their 
Brown cluster ids (Brown), using 1000 clus¬ 
ters acquired from a large scale Twitter corpus 


^We used the dictionary from: 

http://bit.ly/lrXlHdk and extended it with: 
:o, ; |, =/, :s, :S, :p. 


Table 3: Accuracy taken across all rumours in the 
LOO setting. 


(Owoputi et al., 20131. In all cases, simple re¬ 
tweets are removed from the training set to prevent 
bias ([Llewellyn et al., 2014||. 


7 Experiments and Discussion 

Table [3] shows the mean accuracy in the LOO 
scenario following the GPPooled method, which 
pools all reference rumours together ignoring their 
task identities. ICM can not use correlations to tar¬ 
get rumour in this case and so can not be used. The 
majority baseline simply assigns the most frequent 
class from the training set. 

We can observe that methods perform on a level 
similar to majority vote, outperforming it only 
slightly. This indicates how difficult the LOO task 
is, when no annotated target rumour tweets are 
available. 

Figure [Ushows accuracy for a range of methods 
as the number of tweets about the target rumour 
used for training increases. Most notably, perfor¬ 
mance increases from 70% to around 80%, after 
only 10 annotated tweets from the target rumour 
become available, as compared to the results on 
unseen rumours from Table [3] However, as the 
amount of target rumour increases, performance 
does not increase further, which suggests that even 
only 10 human-annotated tweets are enough to 
achieve significant performance benefits. Note 
also how the use of reference rumours is very im¬ 
portant, as methods using only the target rumour 
obtain accuracy similar to the Majority vote clas¬ 
sifier (GP Brown and GP BOW). 

The top performing methods are GPCIM and 
GPPooled, where use of Brown clusters consis¬ 
tently improves results for both methods over 
BOW, irrespective of the number of tweets about 
the target rumour annotated for training. More¬ 
over, GPICM is better than GPPooled both with 
Brown and BOW features and GPCIM with 
Brown is ultimately the best performing of all. 

In order to analyse the importance of Brown 
clusters. Automatic Relevance Determination 











■ Majority 
■GPICM Brown 
■GPICM BOW 
■GPPooled Brown 
■GPPooled BOW 
GP Brown 
■GPBOW 


Figure 1: Accuracy measures for different methods versus the size of the target rumour used for training 
in the LPO setting. The test set is fixed to all but the first 50 tweets of the target rumour. 


supporting 

denying 

questioning 

? 

fake 

7 

10001101 

11111000001 

10001101 

! 

not 

! 

10001100 

001000 

10001100 

not 

? 

hope 

001000 

10001101 

01000111110 

fake 

! 

true 

11111000001 

10001100 

111110010110 

true 

bullshit 

searching 

111110010110 

11110101011111 

01111000010 


Table 4: Top 5 Brown clusters, each shown 
with a representative word. For further 
details please see the cluster definitions at 

http://www.ark.cs.emu.edu/TweetNLP/ 


8 Conclusions 

This paper investigated the problem of classifying 
judgements expressed in tweets about rumours. 
First, we considered a setting where no training 
data from target rumour is available (LOO). With¬ 
out access to annotated examples of the target ru¬ 
mour the learning problem becomes very difficult. 
We showed that in the supervised domain adapta¬ 
tion setting (LPO) even annotating a small number 
of tweets helps to achieve better results. More¬ 
over, we demonstrated the benefits of a multi task 
learning approach, as well as that Brown cluster 
features are more useful for the task than simple 
bag of words. 

Judgement estimation is undoubtedly of great 
value e.g. for marketing, politics and journalism, 

helping to target widely believed topics. Although 

clu"ster,vi‘ewer.html, ■ 

community reac¬ 


U.1W J.VZV'Ul3 
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(ARD) is used (Rasmussen and Williams, 2005 1 ) 
for the best performing GPICM Brown in the LPO 
scenario. Only the case where the first 10 tweets 
are used for training is considered, since it already 
performs very well. Using ARD, we learn a sepa¬ 
rate length-scale for each feature, thus establishing 
their importance. The weights learnt for differ¬ 
ent clusters are averaged over the 7 rumours and 
the top 5 Brown clusters for each label are shown 
in Table |4] We can see that clusters around the 
words fake and bullshit turn out to be important 
for the denying class, and true for both supporting 
and questioning classes. This reinforces our hy¬ 
pothesis that common linguistic cues can be found 
across multiple rumours. Note how punctuation 
proves important as well, since clusters 7 and I 
are also very prominent. 


tions, [Castillo et al. (201^ showed that commu¬ 
nity reaction is correlated with actual rumour ve¬ 
racity. Consequently our classification methods 
may prove useful in the broader and more chal¬ 
lenging task of annotating veracity. 

An interesting direction for future work would 
be adding non-textual features. For example, 
the rumour diffusion pattern (Lukasik et ah, 2045] ) 
may be a useful cue for judgement classification. 
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